OcrV1, Main, Exploration, bibRecord, 002033

Prototype extraction and adaptive OCR

Identifieur interne : 002033 ( Main/Exploration ); précédent : 002032; suivant : 002034

Prototype extraction and adaptive OCR

Auteurs : Y. Xu [États-Unis] ; George Nagy (informaticien) [États-Unis]

Source :

IEEE Transactions on Pattern Analysis and Machine Intelligence [ 0162-8828 ] ; 1999.

RBID : Pascal:00-0247935

Descripteurs français

Pascal (Inist)
- Théorie, Extraction caractéristique, Qualité image, Segmentation image, Algorithme, Concordance forme, Programmation dynamique, Analyse image, Reconnaissance optique caractère, Expérience.

English descriptors

KwdEn :
- Adaptive classification, Algorithms, Dynamic programming, Experiments, Feature extraction, Image analysis, Image quality, Image segmentation, Optical character recognition, Pattern matching, Template matching, Text reader, Theory.

Abstract

To maintain OCR accuracy with decreasing quality of page image composition, production, and digitization, it is essential to tune the system to each document. We propose a prototype extraction method for document-specific OCR systems. The method automatically generates training samples from unsegmented text images and the corresponding transcripts. It is tolerant of transcription errors, so a transcript produced automatically by an imperfect omnifont OCR system can be used. The method is based on new algorithms for estimating character widths, character locations in a word, and match/nonmatch probabilities from unsegmented text. An experimental word recognition system is designed and developed to combine prototype extraction algorithms and segmentation-free word recognition. The system can adapt itself to different page images and achieve high recognition accuracy on heavily degraded print.

Affiliations:

Links toward previous steps (curation, corpus...)

to stream PascalFrancis, to step Corpus: 000781
to stream PascalFrancis, to step Curation: 000013
to stream PascalFrancis, to step Checkpoint: 000763
to stream Main, to step Merge: 002144
to stream Main, to step Curation: 002033

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" level="a">Prototype extraction and adaptive OCR</title>
<author><name sortKey="Xu, Y" sort="Xu, Y" uniqKey="Xu Y" first="Y." last="Xu">Y. Xu</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>Hewlett-Packard Lab</s1>
<s2>Palo Alto CA</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<wicri:noRegion>Hewlett-Packard Lab</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Nagy, G" sort="Nagy, G" uniqKey="Nagy G" first="G." last="Nagy">George Nagy (informaticien)</name>
<affiliation><country>États-Unis</country>
<placeName><settlement type="city">Troy (New York</settlement>
<region type="state">État de New York</region>
</placeName>
<orgName type="lab" n="5">Institut polytechnique Rensselaer</orgName>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">INIST</idno>
<idno type="inist">00-0247935</idno>
<date when="1999">1999</date>
<idno type="stanalyst">PASCAL 00-0247935 EI</idno>
<idno type="RBID">Pascal:00-0247935</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000781</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000013</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000763</idno>
<idno type="wicri:doubleKey">0162-8828:1999:Xu Y:prototype:extraction:and</idno>
<idno type="wicri:Area/Main/Merge">002144</idno>
<idno type="wicri:Area/Main/Curation">002033</idno>
<idno type="wicri:Area/Main/Exploration">002033</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a">Prototype extraction and adaptive OCR</title>
<author><name sortKey="Xu, Y" sort="Xu, Y" uniqKey="Xu Y" first="Y." last="Xu">Y. Xu</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>Hewlett-Packard Lab</s1>
<s2>Palo Alto CA</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<wicri:noRegion>Hewlett-Packard Lab</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Nagy, G" sort="Nagy, G" uniqKey="Nagy G" first="G." last="Nagy">George Nagy (informaticien)</name>
<affiliation><country>États-Unis</country>
<placeName><settlement type="city">Troy (New York</settlement>
<region type="state">État de New York</region>
</placeName>
<orgName type="lab" n="5">Institut polytechnique Rensselaer</orgName>
</affiliation>
</author>
</analytic>
<series><title level="j" type="main">IEEE Transactions on Pattern Analysis and Machine Intelligence</title>
<title level="j" type="abbreviated">IEEE Trans Pattern Anal Mach Intell</title>
<idno type="ISSN">0162-8828</idno>
<imprint><date when="1999">1999</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><title level="j" type="main">IEEE Transactions on Pattern Analysis and Machine Intelligence</title>
<title level="j" type="abbreviated">IEEE Trans Pattern Anal Mach Intell</title>
<idno type="ISSN">0162-8828</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Adaptive classification</term>
<term>Algorithms</term>
<term>Dynamic programming</term>
<term>Experiments</term>
<term>Feature extraction</term>
<term>Image analysis</term>
<term>Image quality</term>
<term>Image segmentation</term>
<term>Optical character recognition</term>
<term>Pattern matching</term>
<term>Template matching</term>
<term>Text reader</term>
<term>Theory</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Théorie</term>
<term>Extraction caractéristique</term>
<term>Qualité image</term>
<term>Segmentation image</term>
<term>Algorithme</term>
<term>Concordance forme</term>
<term>Programmation dynamique</term>
<term>Analyse image</term>
<term>Reconnaissance optique caractère</term>
<term>Expérience</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">To maintain OCR accuracy with decreasing quality of page image composition,  production, and digitization, it is essential to tune the system to each document.  We propose a prototype extraction method for document-specific OCR systems.  The method automatically generates training samples from unsegmented text images and the corresponding transcripts.  It is tolerant of transcription errors, so a transcript produced automatically by an imperfect omnifont OCR system can be used.  The method is based on new algorithms for estimating character widths, character locations in a word, and match/nonmatch probabilities from unsegmented text.  An experimental word recognition system is designed and developed to combine prototype extraction algorithms and segmentation-free word recognition.  The system can adapt itself to different page images and achieve high recognition accuracy on heavily degraded print.</div>
</front>
</TEI>
<affiliations><list><country><li>États-Unis</li>
</country>
<region><li>État de New York</li>
</region>
<settlement><li>Troy (New York</li>
</settlement>
<orgName><li>Institut polytechnique Rensselaer</li>
</orgName>
</list>
<tree><country name="États-Unis"><noRegion><name sortKey="Xu, Y" sort="Xu, Y" uniqKey="Xu Y" first="Y." last="Xu">Y. Xu</name>
</noRegion>
<name sortKey="Nagy, G" sort="Nagy, G" uniqKey="Nagy G" first="G." last="Nagy">George Nagy (informaticien)</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 002033 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 002033 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     Pascal:00-0247935
   |texte=   Prototype extraction and adaptive OCR
}}

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024

	Serveur d'exploration sur l'OCR
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration sur l'OCR

Prototype extraction and adaptive OCR

Prototype extraction and adaptive OCR

Source :

Descripteurs français

English descriptors

Abstract

Links toward previous steps (curation, corpus...)

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri